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Abstract 

We develop a learning principle and an efficient 
algorithm for batch learning from logged bandit 
feedback. This learning setting is ubiquitous in 
online systems (e.g., ad placement, web search, 
recommendation), where an algorithm makes a 
prediction (e.g., ad ranking) for a given input 
(e.g., query) and observes bandit feedback (e.g., 
user clicks on presented ads). We first address 
the counterfactual nature of the learning problem 
through propensity scoring. Next, we prove gen¬ 
eralization error bounds that account for the vari¬ 
ance of the propensity-weighted empirical risk 
estimator. These constructive bounds give rise 
to the Counterfactual Risk Minimization (CRM) 
principle. We show how CRM can be used 
to derive a new learning method - called Pol¬ 
icy Optimizer for Exponential Models (POEM) 
- for learning stochastic linear rules for struc¬ 
tured output prediction. We present a decomposi¬ 
tion of the POEM objective that enables efficient 
stochastic gradient optimization. POEM is eval¬ 
uated on several multi-label classification prob¬ 
lems showing substantially improved robustness 
and generalization performance compared to the 
state-of-the-art. 


1. Introduction 

Log data is one of the most ubiquitous forms of data avail¬ 
able, as it can be recorded from a variety of systems (e.g., 
search engines, recommender systems, ad placement) at lit¬ 
tle cost. The interaction logs of such systems typically con¬ 
tain a record of the input to the system (e.g., features de¬ 
scribing the user), the prediction made by the system (e.g., 
a recommended list of news articles) and the feedback (e.g.. 
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number of ranked articles the user read) (Li et al., 2010). 
The feedback, however, provides only partial information 
- “bandit feedback”- limited to the particular prediction 
shown by the system. The feedback for all the other predic¬ 
tions the system could have made is typically not known. 
This makes learning from log data fundamentally different 
from supervised learning, where “correct” predictions (e.g., 
the best ranking of news articles for that user) together with 
a loss function provide full-information feedback. 

We study the problem of batch learning from logged ban¬ 
dit feedback. Unlike online learning with bandit feedback, 
batch learning does not require interactive experimental 
control over the system. Furthermore, it enables the reuse 
of existing data and offline cross-validation techniques for 
model selection (e.g., “should we perform feature selec¬ 
tion?”, “which learning algorithm to use?”, etc.). 

To solve this batch-learning problem, we first need a coun- 
te/factual estimator (Bottou et al., 2013) of a system’s per¬ 
formance, so that we can estimate how other systems would 
have performed if they had been in control of choosing pre¬ 
dictions. Such estimators have been developed recently for 
the off-policy evaluation problem (Langford et al., 2011), 
(Li et al., 2011), (Li et al., 2014), where data collected 
from the interaction logs of one bandit algorithm is used 
to evaluate another system. 

Our approach to counterfactual learning centers around the 
insight that, to perform robust learning, it is not sufficient 
to have just an unbiased estimator of the off-policy sys¬ 
tem’s performance. We must also reason about how the 
variances of these estimators differ across the hypothesis 
space, and pick the hypothesis that has the best possi¬ 
ble guarantee (tightest conservative bound) for its perfor¬ 
mance. We first prove generalization error bounds analo¬ 
gous to structural risk minimization (Vapnik, 1998) for a 
stochastic hypothesis family using an empirical Bernstein 
argument (Maurer & Pontil, 2009). The constructive na¬ 
ture of these bounds suggests a general principle - Counter- 
factual Risk Minimization (CRM) - for designing methods 
for batch learning from bandit feedback. 
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Using the CRM principle, we derive a new learning algo¬ 
rithm - Policy Optimizer for Exponential Models (POEM) 
- for structured output prediction. The training objective 
is decomposed using repeated variance linearization, and 
optimizing it using AdaGrad (Duchi et al., 2011) yields a 
fast and effective algorithm. We evaluate POEM on several 
multi-label classihcation problems, verify that its empirical 
performance supports the theory, and demonstrate substan¬ 
tial improvement in generalization performance over the 
state-of-the-art. 

We review existing approaches in Section 2. The learning 
setting is detailed in Section 3, and contrasted with super¬ 
vised learning. In Section 4, we derive the Counterfactual 
Risk Minimization learning principle and provide a rule of 
thumb for setting hyper-parameters. In Section 5, we in¬ 
stantiate the CRM principle for structured output prediction 
using exponential models and construct an efficient decom¬ 
position of the objective for stochastic optimization. Em¬ 
pirical evaluations are reported in Section 6 and we con¬ 
clude with future directions and discussion in Section 7. 

2. Related Work 

Existing approaches for batch learning from logged ban¬ 
dit feedback fall into two categories. The hrst approach 
is to reduce the problem to supervised learning. In princi¬ 
ple, since the logs give us an incomplete view of the feed¬ 
back for different predictions, one could first use regression 
to estimate a feedback oracle for unseen predictions, and 
then use any supervised learning algorithm using this feed¬ 
back oracle. Such a two-stage approach is known to not 
generalize well (Beygelzimer & Langford, 2009). More 
sophisticated techniques using a cost weighted classifica¬ 
tion (Zadrozny et al., 2003) or the Offset Tree algorithm 
(Beygelzimer & Langford, 2009) allow us to perform batch 
learning when the space of possible predictions is small. In 
contrast, our approach generalizes structured output predic¬ 
tion, with exponential-sized prediction spaces. 

The second approach to batch learning from bandit feed¬ 
back uses propensity scoring (Rosenbaum & Rubin, 1983) 
to derive unbiased estimators from the interaction logs 
(Bottou et al., 2013). These estimators are used for a small 
set of candidate policies, and the best estimated candidate is 
picked via exhaustive search. In contrast, our approach can 
be optimized via gradient descent, over hypothesis families 
(of infinite size) that are equally as expressive as those used 
in supervised learning. 

Our approach builds on counterfactual estimators that have 
been developed for off-policy evaluation. The inverse 
propensity scoring estimator can be optimal when we have 
a good model of the historical algorithm (Strehl et al., 
2010), (Li et al., 2014), (Li et al., 2015), and doubly robust 


estimators are even more efficient when we additionally 
have a good model of the feedback (Langford et al., 201 1). 
In our work, we focus on the inverse propensity scoring es¬ 
timator, but the results we derive hold equally for the dou¬ 
bly robust estimators. Recent work (Thomas et al., 2015) 
has additionally developed tighter confidence bounds for 
counterfactual estimators, which can be directly co-opted 
in our approach to counterfactual learning. 

In the current work, we concentrate on the case where 
the historical algorithm was a stationary, stochastic policy. 
Techniques like exploration scavenging (Langford et al., 
2008) and bootstrapping (Mary et al., 2014) allow us to 
perform counterfactual evaluation even when the historical 
algorithm was deterministic or adaptive. 

Our strategy of picking the hypothesis with the tight¬ 
est conservative bound on performance mimics similar 
successful approaches in other problems like supervised 
learning (Vapnik, 1998), risk averse multi-armed bandits 
(Galichet et al., 2013), regret minimizing contextual ban¬ 
dits (Langford & Zhang, 2008) and reinforcement learning 
(Garcia & Lernandez, 2012). 

Beyond the problem of batch learning from bandit 
feedback, our approach can have implications for sev¬ 
eral applications that require learning from logged ban¬ 
dit feedback data: warm-starting multi-armed bandits 
(Shivaswamy & Joachims, 2012) and contextual bandits 
(Strehl et al., 2010), pre-selecting retrieval functions for 
search engines (Hofmann et al., 2013), and policy evalua¬ 
tion for contextual bandits (Li et al., 2011), to name a few. 

3. Learning Setting: Batch Learning with 
Logged Bandit Feedback 

Consider a structured output prediction problem that takes 
as input x G X and outputs a prediction y G y. Lor ex¬ 
ample, in multi-label document classihcation, x could be a 
news article and y a bitvector indicating the labels assigned 
to this article. The inputs are assumed drawn from a hxed 
but unknown distribution Pr(A’), x Pr(T’). Consider 
a hypothesis space TL of stochastic policies. A hypothesis 
h{y \ x) G Ti, dehnes a probability distribution over the 
output space y, and the hypothesis makes predictions by 
sampling, y ^ h{y \ x). Note that this dehnition also in¬ 
cludes deterministic hypotheses, where the distributions as¬ 
sign probability 1 to a single y. Lor notational convenience, 
denote hiy \ x) by h{x), and the probability assigned by 
h{x) to y as h{y \ x). 

In interactive learning systems, we only observe feedback 
5(x, y) for the y sampled from h{x). In this work, feedback 
5 >^]Risa cardinal loss that is only observed at the 

sampled data points. Small values for 5{x, y) indicate user 
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Table 1. Comparison of assumptions, hypotheses and learning principles for supervised learning and batch learning with bandit feedback. 

Setting 

Distribution 

Data, V 

Hypothesis, h 

Loss 

Learning principle 

Supervised 

(a;, 2 /*)~Pr(AxT) 

{xi,y*i} 

y = h[x) 

A{y*, ■) known 

argmin^ R{h) C ■ Reg{R) 

Batch w/bandit 

x^PrjA), y'^ho{x) 

{xi,yi,Si,pi} 

y-~-h{y\x) 

5{x,-) unknown 

argmin, R^(/r) +A- 


satisfaction with y for x, while large values indicate dissat¬ 
isfaction. The expected loss - called risk - of a hypothesis 
R{h) is defined as, 

R{h') 2/)] ■ 


of the logging policy, we keep track of the propensity, 
/to( 2 /1 x) of the historical system to generate y for x. From 
these propensity-augmented logs 


The goal of the system is to minimize risk, or equivalently, 
maximize expected user satisfaction. The aim of learning 
is to find a hypothesis h G R that has minimum risk. 

We wish to re-use the interaction logs of these systems for 
batch learning. Assume that its historical algorithm acted 
according to a stationary policy /io(a;) (also called logging 
policy). The data collected from this system is 

= {{Xl,yi,5i),. . . , {Xn,yn,^n)}, 
where yt ho{xi) and S^ = S{xi,yi). 

Sampling bias. V cannot be used to estimate R{h) for a 
new hypothesis h using the estimator typically used in su¬ 
pervised learning. We ideally need either full information 
about S{xi, •) or need samples y ^ h{xi) to directly esti¬ 
mate R{h). This explains why, in practice, model selection 
over a small set of candidate systems is typically done via 
A/B tests, where the candidates are deployed to collect new 
data sampled according to y ^ h{x) for each hypothesis 
h. A relative comparison of the assumptions, hypotheses, 
and principles used in supervised learning vs. our learn¬ 
ing setting is outlined in Table 1. Fundamentally, batch 
learning with bandit feedback is hard because 1) is both bi¬ 
ased (predictions favored by the historical algorithm will 
be over-represented) and incomplete (feedback for other 
predictions will not be available) for learning. 

4. Learning Principle: Counterfactual Risk 
Minimization 

The distribution mismatch between /ig and any hypothesis 
h G R can be addressed using importance sampling, which 
corrects the sampling bias as; 


where pi = ho{yi \ Xi), we can derive an unbiased estimate 
of R{h) via Monte Carlo approximation. 


= if; I 


Pi 


( 1 ) 


At first thought, one may think that directly estimating 
R{h) over h G R and picking the empirical minimizer is 
a valid learning strategy. Unfortunately, there are several 
potential pitfalls. 

First, this strategy is not invariant to additive transforma¬ 
tions of the loss and will give degenerate results if the loss 
is not appropriately scaled. In Section 4.1, we develop in¬ 
tuition for why this is so, and derive the optimal scaling of 
5. For now, assume that Vx,Vy,(5(x,2/) G [—1,0]. 

Second, this estimator has unbounded variance, since pi ~ 
0 in T> can cause R(h) to be arbitrarily far away from the 
true risk R(h). This problem can be fixed by “clipping” the 
importance sampling weights (lonides, 2008) 


R^(h) = 


S(x, y) min<M, 


h{y\x) 


’ho{y\x)j\ ’ 


n ^ I p^ ) 


M > 0 is a hyper-parameter chosen to trade-off bias and 
variance in the estimate, where smaller values of M induce 
larger bias in the estimate. Optimizing R^[h] through ex¬ 
haustive enumeration over R yields the Inverse Propensity 
Scoring (IPS) training objective (Bottou et al., 2013) 

= argmin (/i) |. (2) 


R{h?j 2 /)] 




5{x,y) 


hjy I x) ' 

ho{y I a;)_ ■ 


This motivates the propensity scoring approach 
(Rosenbaum & Rubin, 1983). During the operation 


Third, importance sampling typically estimates R^{h) of 
different hypotheses h G R with vastly different variances. 
Consider two hypotheses hi and h 2 , where hi is similar to 
ho, but where /12 samples predictions that were not well ex¬ 
plored by ho. Importance sampling gives us low-variance 
estimates for R^(hi), but highly variable estimates for 
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[h^)- Intuitively, if we can develop variance-sensitive 
confidence bounds over the hypothesis space, optimizing a 
conservative conhdence bound should find a h whose R{h) 
will not be much worse, with high probability. 

Generalization error bound. A standard analysis would 
give a bound that is agnostic to variance introduced by im¬ 
portance sampling. Following our intuition above, we de¬ 
rive a higher order bound that includes the variance term 
using empirical Bernstein bounds (Maurer & Pontil, 2009). 
To develop such a generalization error bound, we first need 
a concept of capacity for stochastic hypothesis classes. For 
any stochastic class R, dehne an auxiliary function class 
Ru = {fh : A” X 3^1—>• [0, 1]}. Each h G R corresponds to a 
function fh G 

t ( \ 1 , I \ ,'x\ 

= l + (3) 

fh is a deterministic, bounded function, and satishes 

^x^yr^hoix) [fh{x, y)] = l + R^{h)/M. (4) 

Hence, we can use classic notions of capacity for Ru to 
reason about the convergence of R^{h) —>■ R^{h). 

Recall the covering number Afooi^, R,ti) for a func¬ 
tion class R (refer (Anthony & Bartlett, 2009), 
(Maurer & Pontil, 2009) and the references therein). 
Dehne an e—cover Af{e,A, || • ||oo) for a set A C R" 
to be the size of the smallest cardinality subset Aq C A 
such that A is contained in the union of balls of radius e 
centered at points in Aq, in the metric induced by || • ||oo- 
The covering number is, 

Mocie,R,n)= sup A/'(e, j/*)}), || • ||oo), 

ixi,yi)G(Xxy)^ 

where R{{{xi,yi)}) is the function class conditioned on 
sample {{xi,yi)}. 


^{{{xi,yi)}) = {(/(xi, yi),..., /(x„, yn)) : f GR}. 

Our measure for the capacity of our stochastic class R to 
“ht” a sample of size n shall be J\foo (^, Ru 
Theorem 1. For a compact notation, define 


Uh = 5iumi{M,h{yi \ Xi)/pi}, Uh = /n, 

n 

Varh{u) = '^{uh - uKf/{n - 1), 

Q-uiR-i) = log(10-A/'oo(-,JS^,2n)/7), 0 < 7 < 1. 

n 

With probability at least 1 — y in the random vector 
{xi,yi) - ■■ (x„,y„), with Xi Pr(A’) andyi hoixi). 


and observed losses Si,... ,dn, for n > 16 and a stochas¬ 
tic hypothesis space R with capacity Afoo{^, Ru, 2n), 

'ihGR: R[h) < R^{h) + Vl^Varh{u)QH{n,y)/n 
-\- M ■ 15Q(n,7)/(n — 1). 

Proof. Follow the proof of Theorem 6 of (Maurer & Pontil, 
2009) with the function class as Ru- Use Equations (3), (4) 
to translate from fh{x, y) to R^{h). R^{h) = M ■ fh — 1, 
R^{h) = M ■ fh — I, and M‘^Varh{u) = Varf^{u). 
Einally, since ^(•,-) < 0, hence R{h) < R^{h). □ 

CRM Principle. This generalization error bound is con¬ 
structive, and it motivates a general principle for designing 
machine learning methods for batch learning from bandit 
feedback. In particular, a learning algorithm following this 
principle should jointly optimize the estimate R^ fh) as 
well as its empirical standard deviation, where the latter 
serves as a data-dependent regularizer. 

= argmin . (5) 

h^n \ V n J 

M > 0 and A > 0 are regularization hyper-parameters. 
When A = 0, we recover the Inverse Propensity Scoring 
objective of Equation (2). In analogy to Structural Risk 
Minimization (Vapnik, 1998), we call this principle Coun¬ 
terfactual Risk Minimization, since both pick the hypothe¬ 
sis with the tightest upper bound on the true risk R{h). 


4.1. Optimal Loss Scaling 


When performing supervised learning with true labels y* 
and a loss function A(y*,-), empirical risk minimization 
using the standard estimator is invariant to additive trans¬ 
lation and multiplicative scaling of A. The risk estimators 
R{h) and R^ fh) in bandit learning, however, crucially re¬ 
quire 5f, •) G [—1,0]. 

Consider, for example, the case of (5(-, •) > 0. The training 
objectives in Equation (2) (IPS) and Equation (5) (CRM) 
become degenerate! A hypothesis h G R that completely 
avoids the sample V (i.e. Vi = 1,. .. ,n, hfyi \ Xi) = 0) 
trivially achieves the best possible R^ fh) (= 0) with 0 
empirical variance. This degeneracy arises because when 
Sf,-) > 0, the optimization objectives are a lower bound 
on Rfh), whereas what we need is an upper bound. 


Eor any bounded loss Sf, •) G [V; A], we have, Mx 


^y^hix) [<5 (a;, y)] < A-f ^y^ho (x) 


{S{x,y)-A) 


hfy I x) 

hofy I x) 


We assert that this is the tightest possible upper bound pos¬ 
sible without additional assumptions. Since the optimiza¬ 
tion objectives in Equations (2),(5) are unaffected by a con¬ 
stant scale factor (e.g.. A—y), we should transform S S' 
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to derive a conservative training objective w.r.t. 6', 


S' = {6-A}/{A-x7}. 


4.2. Selecting hyper-parameters 


We propose selecting the hyper-parameters M > 0 and 
A > 0 via validation. However, we must be care¬ 
ful not to set M too small or A too big. The esti¬ 
mated risk {h) € [—M, 0], while the variance penalty 


Varh{n) 


M 

2^ 


If M is too small, all hypothe¬ 


ses will have the same biased estimate of risk MR^{ho), 
since all the importance sampling weights will be clipped. 
Similarly, if A ^ 0, a hypothesis h G TL that completely 
avoids 1) achieves the best possible training objective of 
0. As a rule of thumb, we can calibrate M and A so that 
the estimator is unbiased and objective is negative for some 
h G %. When ho G R, M ~ max{pi}/min{pi} and 


R^{ho) + \ 


Vavhf) (u) 


< 0 are natural choices. 


4.3. When is counterfactual learning possible? 

The bounds in Theorem 1 are with respect to the random¬ 
ness in ho- Known impossibility results for counterfactual 
evaluation using ho (Langford et ah, 2008) also apply to 
counterfactual learning. In particular, if ho was determinis¬ 
tic, or even stochastic but without full support over y, it is 
easy to engineer examples involving the unexplored y G y 
that guarantee sub-optimal learning even as |22| —oo. 
Also, a stochastic ho with heavier tails need not always al¬ 
low more effective learning. From importance sampling 
theory (Owen, 2013), what really matters is how well ho 
explores the regions of y with favorable losses. 


5. Learning Algorithm: POEM 

We now use the CRM principle to derive an efficient al¬ 
gorithm for structured output prediction using linear rules. 
Classic models in supervised learning (e.g., structured sup¬ 
port vector machines (Tsochantaridis et al., 2004) and con¬ 
ditional random helds (Lafferty et al., 2001)) predict using 

= argmax{w • (j){x, y)} , (6) 

yey 

where ru is a d—dimensional weight vector, and (j){x, y) is 
a d—dimensional joint feature map. For example, in multi¬ 
label document classihcation, for a news article x and a 
possible assignment of labels y represented as a bitvec- 
tor, (l){x, y) could simply be a concatenation of the bag-of- 
words features of the document {x), one copy for each of 
the assigned labels in y,x ^ y. Several efficient inference 
algorithms have been developed to solve Equation (6). 

Consider the following stochastic family T-Lun, 


parametrized by w. A hypothesis h^{x) G TLiin 
samples y from the distribution 

hw{y I x) = exp(w • (j){x, y))IZ{x). 

Z{x) = '^yi^y exp(u> • (j){x, y')) is the partition function. 
This can be thought of as the “soft-max” variant of the 
“hard-max” rules from Equation (6). Additionally, for a 
temperature multiplier a > l,w ^ aw induces a more 
“peaked” distribution haw that preserves the modes of hw, 
and intuitively is a “more deterministic” variant of hw 

hw lies in the exponential family of distributions, and has a 
simple gradient, 

Whw{y I x) = hw{y \ x) {(j){x,y)-¥.y,^h^(y,) [(/>(x,2/')]} ■ 

Consider a bandit-feedback structured-output dataset V = 
{{xi,yi,Si,pi),{Xn,yn,Sn,Pn)}- In multi-label doc¬ 
ument classihcation, this data could be collected from an 
interactive labeling system, where each y indicates the la¬ 
bels predicted by the system for a document x. The feed¬ 
back S{x, y) is how many labels (but not which ones) were 
correct. To perform learning, hrst we scale the losses as 
outlined in Section 4.1. Next, instantiating the CRM prin¬ 
ciple (Equation (5)) for T-Lun, (using notation analogous to 
that in Theorem 1, adapted for Run), yields the POEM 
training objective. 


POEM Training Objective: 


* 

w 


argmin Uu, -f A 


Varw{u) 


(7) 


i - x ■ exp(u; • — 

„ =dimin|M,-—— -1, Uw 

Pi ■ ) 


n 

Varw{u) = '^{uw'‘ - /{n - 1). 

2=1 


n 

^ ^ '^W 

i=l 


While the objective in Equation (7) is not convex in w (even 
for A = 0), prior work (Yu et al., 2010), (Lewis & Overton, 
2013) has established theoretically sound modihcations to 
L-BEGS for non-zmooth non-convex optimization. We 
hnd that batch gradient descent (e.g., L-BEGS out of the 
box) and the stochastic gradient approach introduced be¬ 
low hnd local optima that have good generalization error. 


Software implementing POEM is available at 
http://www.cs.cornell.edu/~adith/poem/ for down¬ 
load, as is all the code and data needed to run each of the 
experiments reported in Section 6. 


5.1. Iterated Variance Majorization 

The POEM training objective in Equation (7), specihcally 
the variance term ^Varw{u), resists stochastic gradi¬ 
ent optimization in the presented form. To remove this 
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obstacle, we now develop a Majorization-Minimization 
scheme, similar in spirit to recent approaches to multi¬ 
class SVMs (van den Burg & Groenen, 2014) that can be 
shown to converge to a local optimum of the POEM train¬ 
ing objective. In particular, we will show how to decom¬ 
pose y^Varyj{u) as a sum of differentiable functions (e.g., 
so that we can optimize the overall 
training objective at scale using stochastic gradient descent. 

Proposition 1. For any wq, 

n n 

E- tU* + BilJg E{- + C'lDo 

i—1 i—1 

= Q{w;wo). 

= -Ti^/{{n - l)\/Var^„(u)}, 

= l/{2(n - l)^yVar^^{u)}, 

V _ y'Var^giu) 

2(n - 2 

Proof. Consider a first order Taylor approximation of 
^JVaryj{u) around wg, ^/'■ is concave. Again Taylor ap¬ 
proximate noting that —{-j^ is concave. □ 

Iteratively minimizing = argmin.^, Q(w; w*) ensures 
that the sequence of iterates w^,..., are successive 
minimizers of yJVarw{u). Hence, during an epoch t, 
POEM proceeds by sampling uniformly i ^ V, comput¬ 
ing and, for learning rate p, updating 

w ■(- lu — pjVuu,* -I- \\/n{AwfS/Uw'’ + 2BunUw^Vuw'‘)}. 

After each epoch, ^ w, and iterated minimization 

proceeds until convergence. 

6. Experiments 

We now empirically evaluate the prediction performance 
and computational efficiency of POEM. Consider multi¬ 
label classification with input a: G R.^ and prediction y G 
{0,1}®. Popular supervised algorithms that solve this prob¬ 
lem include Structured SVMs (Tsochantaridis et al., 2004) 
and Conditional Random Eields (Lafferty et al., 2001). In 
the simplest case, CRE essentially performs logistic re¬ 
gression for each of the q labels independently. As out¬ 
lined in Section 5, we use a joint feature map: (j>{x, y) = 
X ®y. We conducted experiments on different multi-label 
datasets collected from the LibSVM repository, with dif¬ 
ferent ranges for p (features), q (labels) and n (samples) 
represented as summarized in Table 2. 

Experiment methodology. We employ the Supervised 
I—Bandit conversion (Agarwal et al., 2014) method. Here, 
we take a supervised dataset T>* = {(xi, ... (a:„, y*)} 


Table 2. Corpus statistics for different multi-label datasets from 
the LibSVM repository. LYRL was post-processed so that only 
top level categories were treated as labels. 


Name 

p{# features) 

g(# labels) 

'Strain 

'fT'test 

Scene 

294 

6 

1211 

1196 

Yeast 

103 

14 

1500 

917 

TMC 

30438 

22 

21519 

7077 

LYRL 

47236 

4 

23149 

781265 


and simulate a bandit feedback dataset from a logging pol¬ 
icy Hq by sampling yi ^ ho{xi) and collecting feedback 
A{y*,yi). In principle, we could use any arbitrary stochas¬ 
tic policy as ho. We choose a CRE trained on 5% of V* 
as ho using default hyper-parameters, since they provide 
probability distributions amenable to sampling. In all the 
multi-label experiments, A{y*,y) is the Hamming loss be¬ 
tween the supervised label y* vs. the sampled label y for 
input X. Hamming loss is just the number of incorrectly as¬ 
signed labels (both false positives and false negatives). To 
create bandit feedback = {{xi,yi,5i = A{y*,yi),pi = 
hoiUi I Xi))}, we take four passes through V* and sample 
labels from ho. Note that each supervised label is worth 
~ |3^| = 2'J bandit feedback labels. We can explore dif¬ 
ferent learning strategies (e.g., IPS, CRM, etc.) on V and 
obtain learnt weight vectors Wips,Wcrm, etc. On the su¬ 
pervised test set, we then report the expected loss per in¬ 
stance R{w) = and compare 

the generalization performance of these learning strategies. 

Baselines and learning methods. The expected Ham¬ 
ming loss of ho is the baseline to beat. Lower loss is 
better. The naive, variance-agnostic approach to counter- 
factual learning (Bottou et al., 2013) can be generalized to 
handle parametric multilabel classification (Equation (7) 
with A = 0). We optimize it either using L-BEGS (IPS(i3)) 
or stochastic optimization (IPS(5)). POEM(5) uses our 
Iterative-Majorization approach to variance regularization 
as outlined in Section 5.1, while POEM(S) is a L-BEGS 
variant. Einally, we report results from a supervised CRE 
as a skyline, despite its unfair advantage of having access 
to the full-information examples. 

We keep aside 25% of T> as a validation set - we use the 
unbiased counterfactual estimator from Equation ( 1 ) for se¬ 
lecting hyper-parameters. A = cA*, where A* is the cali¬ 
bration factor from Section 4.2 and c G [l0“®,..., l] in 
multiples of 10. The clipping constant M is similarly set 
to the ratio of the 90%iZe to the 10%i/e propensity score 
observed in the training set of V. Eor all methods, when 
optimizing any objective over w, we always begin the opti¬ 
mization from u> = 0 (=^ /luj = uniform(3^)). We use 
mini-batch AdaGrad (Duchietal., 2011) with batch size 
= 100 to adapt our learning rates for the stochastic ap¬ 
proaches and use progressive validation (Blum et al., 1999) 
and gradient norms to detect convergence. Einally, the en- 
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tire experiment set-up is run 10 times (i.e. ho trained on 
randomly chosen 5% subsets, V re-created, and test set per¬ 
formance of different approaches collected) and we report 
the averaged test set expected error across runs. 

6.1. Does variance regularization improve 
generalization? 

Results are reported in Table 3. We statistically test the per¬ 
formance of POEM against IPS (batch variants are paired 
together, and the stochastic variants are paired together) 
using a one-tailed paired difference t-test at significance 
level of 0.05 across 10 runs of the experiment, and find 
POEM to be significantly better than IPS on each dataset 
and each optimization variant. Eurthermore, on all datasets 
POEM learns a hypothesis that substantially improves over 
the performance of Hq. This suggests that the CRM princi¬ 
ple is practically useful for designing learning algorithms, 
and that the variance regularizer is indeed beneficial. 

Table 3. Test set Hamming loss for different approaches to multi¬ 
label classification on different datasets, averaged over 10 runs. 
POEM is significantly better than IPS on each dataset and each 
optimization variant (one-tailed paired difference t-test at signifi¬ 
cance level of 0.05). 



Scene 

Yeast 

TMC 

LYRE 

ho 

1.543 

5.547 

3.445 

1.463 

IPS(B) 

1.193 

4.635 

2.808 

0.921 

POEM(H) 

1.168 

4.480 

2.197 

0.918 

IPS(<S) 

1.519 

4.614 

3.023 

1.118 

P0EM(5) 

1.143 

4.517 

2.522 

0.996 

CRF 

0.659 

2.822 

1.189 

0.222 


6.2. How computationally efficient is POEM? 

Table 4 shows the time taken (in CPU seconds) to run each 
method on each dataset, averaged over different validation 
runs when performing hyper-parameter grid search. Some 
of the timing results are skewed by outliers, e.g., when un¬ 
der very weak regularization, CRTs tend to take a lot longer 
to converge. In aggregate, it is clear that the stochastic vari¬ 
ants are able to recover good parameter settings in a frac¬ 
tion of the time of batch L-BEGS optimization, and this is 
even more pronounced when the number of labels grows 
(the run-time is dominated by computation of Z{xi)). 

Table 4. Average time in seconds for each validation run for dif¬ 
ferent approaches to multi-label classification. CRF is the scikit- 
leam implementation (Pedregosa et al., 2011). On all datasets, 
stochastic approaches are substantially faster than batch gradients. 



Scene 

Yeast 

TMC 

LYRE 

IPS(B) 

2.58 

47.61 

136.34 

21.01 

IPS(5) 

1.65 

2.86 

49.12 

13.66 

POEM(B) 

75.20 

94.16 

949.95 

561.12 

P0EM(5) 

4.71 

5.02 

276.13 

120.09 

CRF 

4.86 

3.28 

99.18 

62.93 


6.3. Can MAP predictions derived from stochastic 
policies perform well? 

Eor the policies learnt by POEM as shown in Table 3, Ta¬ 
ble 5 reports the averaged performance of the deterministic 
predictor derived from them. Eor a learnt weight vector w, 
this simply amounts to applying Equation (6). In practice, 
this method of generating predictions can be substantially 
faster than sampling since computing the argmax does not 
require computation of the partition function Z(a;) which 
can be expensive in structured output prediction. Erom Ta¬ 
ble 5, we see that the loss of the deterministic predictor is 
typically not far from the loss of the stochastic policy, but 
often slightly better. 

Table 5. Mean Hamming loss of MAP predictions from the poli¬ 
cies in Table 3. POEMmap is not significantly worse than POEM 
(one-sided paired difference t-test, significance level 0.05). 



Scene 

Yeast 

TMC 

EYRE 

P0EM(5) 

P0EM(5)map 

1.143 

1.143 

4.517 

4.065 

2.522 

2.299 

0.996 

0.880 


6.4. How does generalization improve with size of 25? 



Figure I. Generalization performance of POEM(<S) as a function 
of n on the Yeast dataset. Even with Replay Count = 2®, 
POEM(<S) is learning from much less information than the CRF 
(each supervised label conveys 2^"* bandit label feedbacks). 

As we collect more data under ho, our generalization error 
bound indicates that prediction performance should even¬ 
tually approach that of the optimal hypothesis in the hy¬ 
pothesis space. We can simulate n — c» by replaying the 
training data multiple times, collecting samples y ^ ho{x). 
In the limit, we would observe every possible y in the ban¬ 
dit feedback dataset, since ho{x) has non-zero probability 
of exploring each prediction y. However, the learning rate 
may be slow, since the exponential model family has very 
thin tails, and hence may not be an ideal logging distri¬ 
bution to learn from. Holding all other details of the ex¬ 
periment setup fixed, we vary the number of times we re¬ 
played the training set (ReplayCount) to collect samples 
from ho, and report the performance of POEM(iS) on the 
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Yeast dataset in Figure 1. 

6.5. How does quality of hg affect learning? 

In this experiment, we change the fraction of the training 
set / • ritrain that was used to train the logging policy; as / 
is increased, the quality of Hq improves. Intuitively, there’s 
a trade-off; better ho probably samples correct predictions 
more often and so produces a higher quality 1) to learn 
from, but it should also be harder to beat Hq. We vary / 



/ 


Figure 2. Performance of POEM(iS) on the Yeast dataset as ho is 
improved. The fraction / of the supervised training set used to 
train ho is varied to control ho’s quality, ho performance does not 
reach CRF when / = 1 because we do not tune hyper-parameters, 
and we report its expected loss, not the loss of its MAP prediction. 

from 1% to 100% while keeping all other conditions iden¬ 
tical to the original experiment setup in Figure 2, and find 
that POEM(5) is able to consistently find a hypothesis at 
least as good as Hq. Moreover, even V collected from a 
poor quality ho (0.5 < / < 0.2) allows POEM(iS) to effec¬ 
tively learn an improved policy. 

6.6. How does stochasticity of ho affect learning? 

Einally, the theory suggests that counterfactual learning is 
only possible when ho is sufficiently stochastic (the gen¬ 
eralization bounds hold with high probability in the sam¬ 
ples drawn from ho). Does CRM degrade gracefully when 
this assumption is violated? We test this by introducing 
the temperature multiplier w i—>■ aw, a > 0 (as discussed 
in Section 5) into the logging policy. Eor ho = h^o, we 
scale Wo >—>■ awo, to derive a “more deterministic” vari¬ 
ant of ho, and generate T> ^ hawo- We report the perfor¬ 
mance of POEM(5) on the LYRE dataset in Eigure 3 as 
we change a G [0.5,..., 32], compared against ho, and the 
deterministic predictor - ho map - derived from ho- So 
long as there is some minimum amount of stochasticity in 
ho, POEM(5) is still able to find a w that improves upon 
ho and ho map. The margin of improvement is typically 
greater when ho is more stochastic. Even when ho is too 
deterministic {a > 2"^), performance of POEM(iS) simply 


recovers ho map, suggesting that the CRM principle in¬ 
deed achieves robust learning. 



Figure 3. Performance of POEM(<S) on the LYRE dataset as ho 
becomes more deterministic. For a > 2®, /iq = ho map (within 
machine precision). 

We observe the same trends (Figures 1, 2 and 3) across 
all datasets and optimization variants. They also remain 
unchanged when we include (2—regularization (analogous 
to supervised CRTs to capture the capacity of T-Lun)- 

7. Conclusion 

Counterfactual risk minimization serves as a robust prin¬ 
ciple to design algorithms that can learn from a batch of 
bandit feedback interactions. The key insight for CRM is 
to expand the classical notion of a hypothesis class to in¬ 
clude stochastic policies, reason about variance in the risk 
estimator, and derive a generalization error bound over this 
hypothesis space. The practical take-away is a simple, data- 
dependent regularizer that guarantees robust learning. Fol¬ 
lowing the CRM principle, we developed POEM for struc¬ 
tured output prediction. POEM can optimize over rich pol¬ 
icy families (exponential models corresponding to linear 
rules in supervised learning), and deal with massive output 
spaces as efficiently as classical supervised methods. 

The CRM principle more generally applies to supervised 
learning with non-differentiable losses, since the objective 
does not require the gradient of the loss function. We also 
foresee extensions of this work that relax some of the as¬ 
sumptions, e.g., to handle noisy 5{-, ■), and ordinal or co¬ 
active feedback, or adaptive ho etc. 
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